feat: per-pick confidence scores + abstention (Phase 2.4) by hallelx2 · Pull Request #21 · hallelx2/vectorless-engine

hallelx2 · 2026-05-27T02:22:13Z

Summary

Selection JSON schema now accepts a picks: [{id, confidence}] shape carrying per-pick confidence in [0.0, 1.0]. The legacy selected_section_ids shape still parses so older / weaker models keep working.
When every confidence falls strictly below retrieval.abstain.below (default 0.4), /v1/query returns an abstention response (sections: [], abstained: true) and /v1/answer skips synthesis entirely and answers with a canonical refusal.
Successful responses surface a confidences map keyed by section_id. Abstention responses additionally carry abstention_reason, min_confidence_threshold, and candidate_confidences.

Design rationale

Additive on the strategy boundary. Result.SelectedIDs stays []tree.SectionID — Confidences is a separate map[SectionID]float64 field, omitted from JSON when empty. Callers that don't care about confidence see no API change.
Strategies never abstain. Each strategy populates Result.Confidences if the model returned them; the abstention decision lives entirely in the API layer (internal/api/server.go). This keeps the strategies pure ("return what the model picked") and confines policy to one place.
Abstention requires explicit signal. A nil confidence map (legacy LLM response, or new shape with no confidence keys populated) is the "no signal" sentinel. The abstention check returns false for nil / empty maps so older models cannot accidentally trip a refusal.
"All picks below" semantics. If even one pick scored at-or-above the threshold, the engine has enough signal to surface evidence — abstention is reserved for the case where every candidate is weak. This matches the plan and avoids over-refusal.
Trace-token absence on abstention. Replay isn't meaningful for an abstention (there's no retrieval result to reproduce), so abstention responses omit trace_token and aren't written to the replay store.

Opt-out / configuration

Knob	Where	Default
Per-call	`enable_abstain: false` on the `/v1/query` or `/v1/answer` request body	absent (use config)
Server	`retrieval.abstain.enabled: false` in config.yaml	`true` (opt-out)
Env	`VLE_RETRIEVAL_ABSTAIN_ENABLED=false`	unset
Threshold	`retrieval.abstain.below: 0.5` / `VLE_RETRIEVAL_ABSTAIN_BELOW=0.5`	`0.4`

Test plan

go build ./... clean
go vet ./... clean
go test ./... all green (all pre-existing tests pass + new coverage)
New parse-side tests: new-shape, legacy-shape, mixed-shape, clamped, deduped, new-shape-no-confidences
New strategy tests: SinglePass / ChunkedTree / Agentic surface Confidences, strategies never abstain
New config tests: defaults, env overrides (enable / disable / parse / edge), bad-input rejection, validation
New API tests: shouldAbstain predicate, helper sentinels, respondAbstained shape (query + answer), synthesis LLM tripwire (must not be called on abstention path), trace_token absent on abstention

Before / after examples

LLM new-shape response → confidences populated

// LLM returns:
{"picks":[{"id":"sec_a","confidence":0.82},{"id":"sec_b","confidence":0.31}],"reasoning":"x"}

// /v1/query response (no abstention because 0.82 ≥ 0.4):
{
  "document_id": "...",
  "sections": [{"id": "sec_a", ...}, {"id": "sec_b", ...}],
  "confidences": {"sec_a": 0.82, "sec_b": 0.31},
  ...
}

LLM all-low response → abstained

// LLM returns:
{"picks":[{"id":"sec_a","confidence":0.12},{"id":"sec_b","confidence":0.20}]}

// /v1/query response:
{
  "document_id": "...",
  "query": "...",
  "strategy": "chunked-tree",
  "sections": [],
  "abstained": true,
  "abstention_reason": "no candidate section scored above the confidence threshold",
  "min_confidence_threshold": 0.4,
  "candidate_confidences": {"sec_a": 0.12, "sec_b": 0.20}
}

// /v1/answer response (synthesis skipped):
{
  "document_id": "...",
  "answer": "I cannot answer this question from the supplied document.",
  "citations": [],
  "abstained": true,
  ...
}

Mixed-shape response handled

// LLM returns (some picks with confidence, some without):
{"picks":[{"id":"sec_a","confidence":0.9},{"id":"sec_b"},{"id":"sec_c","confidence":0.4}]}

// confidences map surfaces only present scores:
{"confidences": {"sec_a": 0.9, "sec_c": 0.4}, ...}
// sec_b is in sections[] but absent from confidences (no signal for that pick)

Legacy response → no abstention

// LLM returns:
{"selected_section_ids":["sec_a","sec_b"]}

// /v1/query response: NO confidences map, NO abstention check fires.
// Older models continue to work unchanged.

Summary by CodeRabbit

Release Notes

New Features
- Added confidence-driven abstention: when all candidate section confidences fall below a configured threshold, the API returns an abstention response with empty results instead of uncertain answers.
- New enable_abstain parameter on query and answer endpoints for per-request abstention override control.
- Responses now include per-section confidence scores for transparency.
Configuration
- New retrieval.abstain configuration block with enabled toggle and below confidence threshold (default: 0.4, range: 0.0–1.0).

Extend the selection JSON schema to accept either the legacy {selected_section_ids: [...]} shape or the new {picks: [{id, confidence}]} shape with per-pick confidence in [0.0, 1.0]. ParseSelection returns (ids, confidences, err); legacy responses surface confidences=nil so callers can distinguish "no confidence signal" from "all confidences low". Each strategy plumbs the confidence map through: - SinglePass fills Result.Confidences from the parsed map, filtered against the post-FilterKnownIDs survivors. - ChunkedTree unions per-slice confidence maps (max-wins on duplicate IDs across overlapping slices) and filters to the merged ID set. - Agentic accepts both done-shape variants. The new picks shape surfaces per-pick confidences on the final Result. Result.SelectedIDs stays []tree.SectionID — the change is purely additive. Callers that don't care about confidence see no API change. The strategy never abstains; the API layer's abstention check (next commit) is the only place "all confidences below threshold" becomes an abstention response. Tests cover: new-shape parse, legacy-shape parse, mixed-shape parse (some picks with confidence, some without), confidence clamping, duplicate-pick dedup, per-strategy fill, chunked-tree merge, and the agentic done-with-picks path.

…verrides AbstainBlock carries Enabled + Below (the [0.0, 1.0] confidence threshold below which picks count as "not confident"). When the selection LLM returns explicit per-pick confidence and EVERY pick falls below Below, the API layer surfaces an abstention response instead of pretending the document held an answer. Defaults: Enabled=true (opt-out), Below=0.4. Env overrides: VLE_RETRIEVAL_ABSTAIN_ENABLED (truthy/falsy), VLE_RETRIEVAL_ABSTAIN_BELOW (float in [0,1]). Validation rejects out-of-range Below values; bad env strings preserve the default rather than zeroing the field. Tests cover defaults, env overrides (enable/disable/parse), edge cases (0.0, 1.0 inclusive), bad-input rejection, and validation.

When the selection LLM returns per-pick confidences and every pick falls strictly below retrieval.abstain.below (default 0.4), the API layer skips the normal path and returns an abstention response: /v1/query → sections: [], abstained: true, abstention_reason, min_confidence_threshold, candidate_confidences /v1/answer → answer: "I cannot answer this question from the supplied document.", citations: [], same abstention fields, synthesis LLM call skipped entirely (planning + retrieval usage carried through) The "all picks below" semantics is deliberate: if even one section scored at-or-above the threshold the engine surfaces it as evidence. Abstention is reserved for the case where every candidate is weak. Abstention requires explicit confidence signal — legacy-shape LLM responses (no confidence map) always fall through to the normal path. Per-request `enable_abstain` body field overrides the server config; opt out globally via retrieval.abstain.enabled: false. Other changes: - Result.Confidences threads through the Decomposer (multi-hop plans union confidences max-wins on overlap). - Successful (non-abstained) responses surface a `confidences` map on the wire when the model returned them. - Abstention responses carry no trace_token — there is no retrieval result to replay. - cmd/engine wires cfg.Retrieval.Abstain into the Deps. Tests cover: shouldAbstain predicate (all-below, one-above, boundary, nil/empty); filterConfidencesToIDs sentinel preservation; stringKeyedConfidences conversion; abstentionEnabled body-override precedence; respondAbstained / respondAbstainedAnswer shape; synthesis tripwire (LLM must not be called on abstention path); trace_token absence on abstention. OpenAPI: - enable_abstain on QueryRequest + AnswerRequest. - abstained, abstention_reason, min_confidence_threshold, candidate_confidences, confidences on both response schemas.

sourcery-ai

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-05-27T02:22:24Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR implements confidence-driven abstention across the retrieval engine. Selection LLMs now return per-section confidence scores alongside selected IDs, which flow through all retrieval strategies and decomposer. The API evaluates whether all confidences fall below a configurable threshold, and if so, returns an abstention response (empty sections/answer) instead of weak grounding, with per-request override support.

Changes

Confidence-Driven Abstention

Layer / File(s)	Summary
Abstention Configuration `pkg/config/config.go`, `pkg/config/config_test.go`, `config.example.yaml`	`AbstainBlock` type with `enabled` toggle and `below` threshold (0.0–1.0); defaults to enabled at 0.4; environment overrides and validation enforce bounds.
Retrieval Result & Selection Parser `pkg/retrieval/strategy.go`, `pkg/retrieval/single_pass.go`, `pkg/retrieval/retrieval_test.go`	`Result.Confidences` map added (omitted when absent). Selection parser updated to return `(ids, confidences, error)` and support new `picks` JSON shape with per-ID confidence (clamped, deduped) alongside legacy `selected_section_ids` fallback.
Single-Pass Strategy Confidence Flow `pkg/retrieval/single_pass.go`	Selection prompt and schema extended to prefer `picks` format with confidence constraints. `runSelectionWithRetry` returns confidences; parsing logic normalizes dual-format responses and filters confidences to final deduplicated IDs.
Chunked-Tree Multi-Slice Confidence Merge `pkg/retrieval/chunked_tree.go`	Per-slice result struct includes confidence map; slice goroutines capture confidences; merge stage unions confidences across slices using max-wins rule per section ID.
Agentic Strategy Confidence Picks `pkg/retrieval/agentic.go`, `pkg/retrieval/agentic_test.go`	`done` action now accepts `picks` array with per-ID confidence (preferred format, clamped to [0.0, 1.0]) or falls back to legacy `picked_ids`. System prompt and action protocol instruct model on confidence scoring. Result includes filtered confidences.
Decomposer Multi-Hop Confidence Union `pkg/retrieval/decompose.go`	`DecomposedSelect` delegates to new `DecomposedSelectWithConfidences`; multi-hop execution unions per-sub-question confidences (max-wins per section) and returns (ids, confidences, usage) while preserving first-seen ID order.
API Server Abstention Decision & Shaping `internal/api/server.go`	`Deps` struct adds `Abstain` config; query/answer request bodies accept `enable_abstain` override. Selection refactored to return (ids, confidences, usage). When enabled and all confidences below threshold, routes to abstention response (empty sections/citations, `abstained=true`, refusal text for answer, omitted trace_token) instead of proceeding to re-rank/synthesis. Confidence maps filtered to final IDs and included in responses when present.
API Abstention Tests & OpenAPI Spec `internal/api/abstention_test.go`, `openapi.yaml`	Tests verify threshold logic, confidence filtering, request override semantics, response shapes (abstained flag, reason, threshold value, empty sections, candidate/final confidences), and that abstention skips LLM synthesis. OpenAPI documents `enable_abstain` parameter, confidence and abstention fields, refusal semantics, and trace_token behavior.
Engine Integration `cmd/engine/main.go`	Configured `Retrieval.Abstain` wired into `api.Deps` for request handling.

Sequence Diagram(s)

sequenceDiagram
  participant HTTPRequest
  participant handleQuery
  participant runSelection
  participant shouldAbstain
  participant respondAbstained
  HTTPRequest->>handleQuery: enable_abstain override + query
  handleQuery->>runSelection: retrieve sections + confidences
  runSelection-->>handleQuery: (selectedIDs, confidences, usage)
  handleQuery->>shouldAbstain: (confidences, threshold, enabled)
  shouldAbstain-->>handleQuery: all below threshold?
  alt abstain
    handleQuery->>respondAbstained: shape abstention response
    respondAbstained-->>HTTPRequest: 200 OK, abstained=true, empty sections
  else continue
    handleQuery->>handleQuery: proceed to re-rank/synthesis
    handleQuery-->>HTTPRequest: 200 OK, sections/answer with confidences
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A rabbit hops through confidence and doubt,
Scoring each section, filtering out,
When wisdom falters below the line,
It knows to abstain—a choice so fine!
Multi-hop queries now fear not the fog,
With traces of trust through every log. 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: per-pick confidence scores + abstention (Phase 2.4)' accurately summarizes the two main features added: per-pick confidence scores from selection LLM and an API-layer abstention mechanism based on confidence thresholds.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/confidence-and-abstention

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

hallelx2 added 3 commits May 27, 2026 03:08

Copilot AI review requested due to automatic review settings May 27, 2026 02:22

sourcery-ai Bot reviewed May 27, 2026

View reviewed changes

Copilot started reviewing on behalf of hallelx2 May 27, 2026 02:22 View session

hallelx2 merged commit eac87c6 into main May 27, 2026
5 of 9 checks passed

hallelx2 deleted the feat/confidence-and-abstention branch May 27, 2026 02:24

Copilot AI reviewed May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-pick confidence scores + abstention (Phase 2.4)#21

feat: per-pick confidence scores + abstention (Phase 2.4)#21
hallelx2 merged 3 commits into
mainfrom
feat/confidence-and-abstention

hallelx2 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hallelx2 commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design rationale

Opt-out / configuration

Test plan

Before / after examples

Summary by CodeRabbit

Release Notes

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hallelx2 commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading